LLMs in Production: Deploying the TitanML Takeoff server on AWS EC2
Getting large language models into production quality deployments is a complicated and difficult process. At TitanML, our goal is to make this process faster, easier, and cheaper. Let's go step-by-step through the deployment process of a large language model with AWS using the AWS CLI.
Outlineβ
- Setting up your AWS environment
- Baseline: running inference with pytorch and the transformers library
- Using Titan Takeoff to accelerate GPU inference
- Using Titan Takeoff to enable CPU inference
- Conclusions
Setting up your AWS environmentβ
Amazon Web Services (AWS) is a cloud computing platform that offers a wide variety of services. In this tutorial, we'll be using AWS to deploy a large language model (LLM). There are many ways of interacting with AWS to manage and deploy the required cloud resources. For this tutorial, we'll use the AWS Command Line interface (CLI). If you prefer a graphical interface, all of the following commands can be done through the AWS Console. If you've worked with AWS before, and know how to set up an EC2 instance that you can access over the internet, skip to here.
Install the AWS CLIβ
If you haven't done so already, install the AWS CLI. See here for more information on how to do so.
Configure the AWS CLIβ
We first need to configure the AWS CLI with our AWS credentials. If you haven't set up an AWS account, you can do so here. Once you've created an account, you'll need to create an access key. You can do so by following the instructions here. Once you've created an AWS account, and created programmatic access keys1, you need to set up the AWS CLI with your credentials. To do so, run the following:
aws configure
This command will prompt you for your AWS access key ID, secret access key, default region name, and default output format.
EC2β
Elastic Compute Cloud (EC2) is Amazon's service for launching virtual machines (VMs) on the cloud. It offers a wide variety of instance (VM) types, from small, general purpose instances to large, GPU-accelerated instances. EC2 instances are characterized by many different factors. For our purposes, the most important are the number of virtualized CPUs (vCPUs), the available RAM, and the accelerator type. For a nice comparison of the different instance types, see here.
For the GPU enabled VM in this tutorial, we'll be using the p3.2xlarge
instance type, which has 1 GPU 8 vCPUs, and 61GB of RAM.
For the CPU only VM, we'll be using the c5.2xlarge
instance type, which has 8 vCPUs and 16GB of RAM.
Creating and connecting to an EC2 instanceβ
Before deploying our EC2 instance, we need to tell AWS how we want to connect to it. This means two things: we need to tell the instance how we're going to authenticate, and we need to configure the instances firewall to allow us to connect to it.
Start by creating a key pair that we'll use with SSH to authenticate our connection to the instance. The following CLI command creates a key pair called MyKeyPair, and gives it permissions that only allow you to read it.
Secure Shell Protocol (SSH) is a a way of connecting to remote computers from your local machine. We use it here to access our EC2 instance. For more information, see the wikipedia page. Another option for accessing EC2 instances (in the browser) is EC2 Instance Connect.
aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
chmod 400 MyKeyPair.pem
Key pair names should be globally unique: if you see errors about the key pair already existing, try using a different name. Remember to use that new name in any subsequent commands.
Make sure you save the generated MyKeyPair.pem
file and keep it secure.
You'll need this file to SSH into your EC2 instances.
Once you've created a key pair, you'll need to create a security group. A security group is the AWS method for configuring a VM's firewall. We'll create a security group that allows SSH connections (port 22) and HTTP(s) connections (ports 80 & 443).
Throughout this tutorial, we'll be using environment variables to store information that we'll need to use later. Environment variables only exist in the terminal session that you create them in. Make sure not to exit your terminal session!
export SECURITY_GROUP_ID=$(aws ec2 create-security-group --group-name my-sg --description "My security group" --query 'GroupId' --output text)
aws ec2 authorize-security-group-ingress --group-id $SECURITY_GROUP_ID --protocol tcp --port 22 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id $SECURITY_GROUP_ID --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-id $SECURITY_GROUP_ID --protocol tcp --port 443 --cidr 0.0.0.0/0
Now that the instance's networking is configured correctly, we can start the instance with the following command: We're going to use a p3.2xlarge instance so that we have a GPU attached.
At the time of writing, GPU enabled instances require you to request a limit increase from AWS, and GPU access in the major cloud providers is highly constrained. The Titan inference server is designed to work with both GPU and CPU instances, so you can use a CPU instance if you don't have access to a GPU instance. See the later section for more information.
AMI stands for Amazon Machine Image.
It's a snapshot of a (virtual machine) VM that can be used to create new VMs.
We'll use Amazon's deep learning AMI as the image from which we'll start the VM.
This image has python, docker, and nvidia drivers pre-installed (as well as many common deep learning libraries).
The following command will retrieve the ID of the AMI ID and store it in the AMI_ID
environment variable.
export AMI_ID=$(aws ec2 describe-images --owners amazon --filters 'Name=name,Values=Deep Learning AMI (Ubuntu 18.04) Version ??.?' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text)
Once we have the ID of the AMI, we can start the instance.
export INSTANCE_ID=$(aws ec2 run-instances --image-id $AMI_ID --count 1 --instance-type p3.2xlarge --key-name MyKeyPair --security-group-ids $SECURITY_GROUP_ID --block-device-mappings DeviceName=/dev/sda1,Ebs={VolumeSize=200} --query 'Instances[0].InstanceId' --output text)
By default, the instance will be assigned a random IP address. To make it easier to connect to the instance, we'll assign it an Elastic IP address. Elastic IP addresses will persist even if the instance is stopped and restarted. This means that you can stop and restart your instance without having to relookup where to find your instance.
export ALLOCATION_ID=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
export ASSOCIATION_ID=$(aws ec2 associate-address --instance-id $INSTANCE_ID --allocation-id $ALLOCATION_ID --query 'AssociationId' --output text)
export ELASTIC_IP=$(aws ec2 describe-addresses --filters "Name=association-id,Values=$ASSOCIATION_ID" --query 'Addresses[0].PublicIp' --output text)
If you see a result like the following, wait a few seconds (your instance is booting) and try again.
An error occurred (InvalidInstanceID) when calling the
AssociateAddress operation: The pending instance '...'
is not in a valid state for this operation.
In this section, we've created four AWS resources: an EC2 instance, a security group, a key pair, and an Elastic IP address.
These resources are bound together.
Their identifiers are stored in the environment variables $INSTANCE_ID
, $SECURITY_GROUP_ID
, and $KEY_PAIR_NAME
, and $ELASTIC_IP
We'll use these identifiers in the next section to configure our EC2 instance.
To see the IDs for the objects we've created, run the following
echo "Instance ID: $INSTANCE_ID"
echo "Security Group ID: $SECURITY_GROUP_ID"
echo "Key Pair Name: $KEY_PAIR_NAME"
echo "Elastic IP: $ELASTIC_IP"
Connecting to your instanceβ
Once you've done this, you can SSH into your instance:
We suggest doing this from another terminal window, so that you can have your shell in each machine open at the same time. To do so, open a new terminal window, and run the following command:
echo $ELASTIC_IP
To read out the value of the $ELASTIC_IP
environment variable. Then, in your new terminal, run
export ELASTIC_IP=<value of $ELASTIC_IP>
to set the environment variable in your new terminal.
ssh -i MyKeyPair.pem ubuntu@$ELASTIC_IP
Baseline: running inference with pytorch and the transformers libraryβ
The first thing you might try is running inference with your model using PyTorch and the HuggingFace Transformers. These libraries are great for training, but we'll see that we can do much better for inference purposes.
At your AWS instance's command prompt, run the following command to install HuggingFace Transformers and PyTorch, and the dependencies required to build a simple wrapper server:
pip install torch transformers einops fastapi uvicorn
Once the dependencies have finished installing, we can write a simple server that will run inference on our model.
This implementation uses the FastAPI library to create a simple HTTP server, with a single endpoint that returns the model's output, and the number of generated tokens.
Save the following in a file called main.py
from fastapi import FastAPI
import torch
import transformers
app = FastAPI()
# use the falcon 7b model as an example
name = 'tiiuae/falcon-7b-instruct'
# trust_remote_code is required for falcon models, which contain custom
# code that is not in huggingface's transformers library.
# see the [model card](https://huggingface.co/tiiuae/falcon-7b-instruct) for more information.
tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
torch_dtype=torch.float16, # Load model weights in float16
trust_remote_code=True
).to('cuda')
@app.get("/generate/")
async def generate_response(prompt: str):
with torch.autocast('cuda', dtype=torch.float16):
inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to('cuda')
outputs = model.generate(**inputs, max_new_tokens=100)
num_generated_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
return {"response": response, "generated_tokens": num_generated_tokens}
A few important concepts are included here.
First, we're using the torch.float16
datatype.
This means that the model will be loaded in float16, which will save us some memory.
Otherwise, the model would be loaded in float32, which would cause a GPU out of memory error2.
To launch your new LLM API, run the following command3:
sudo $(which uvicorn) main:app --host 0.0.0.0 --port 80
Once the model has finished downloading, the server should spin up. Now our ML model is deployed, and available to the world! Let's see how its doing by making a call to it. From your local PC (not the AWS instance, although that will work too), run the following curl command4
time curl -X 'GET' \
"http://${ELASTIC_IP}/generate/?prompt=List%2030%20things%20to%20do%20in%20London" \
-H 'accept: application/json'
After the model returns its (hopefully useful) response with information on things to do in London, the time command should print something like the following:
real 0m7.734s
user 0m0.009s
sys 0m0.007s
This is a great start, but it does have some major problems. The instance we've used (p3.2xlarge) is very expensive: $3.06 per hour. This is because it has a powerful GPU. Even with this GPU, the inference is slow! That's even with the model running on an expensive GPU, in float16 mode. With this GPU, we can definitely do better.
Taking It a Step Further: The Titan Inference Serverβ
The Titan Takeoff Inference Server is a tool that allows you to deploy large language models rapidly anywhere. It enables two things:
- If we want superfast inference, we can keep the GPU, and get massive speedups.
- We can run our model on a CPU, which is much cheaper than a GPU.
Using Titan Takeoff to speed up GPU inferenceβ
Let's start with speeding up our GPU inference.
On the same AWS machine, run
pip install titan-iris
Then, run the following command
iris takeoff --model tiiuae/falcon-7b-instruct --port 80 --device cuda
You'll be taken to a webpage to sign up. Don't worry - the sign up is free. Once the command is running, the model and server will start downloading. The output of the command will give you a command to tail the logs from the server. Run this command to see the progress of the optimization process.
Once the server has booted, run the following curl command to send a request to the model:
time curl -X 'POST' \
"http://${ELASTIC_IP}/generate" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"text": "List 30 things to do in London",
"generate_max_length": 100,
"sampling_topk": 1,
"sampling_topp": 1,
"sampling_temperature": 1,
}'
real 0m2.067s
user 0m0.011s
sys 0m0.007s
You can see the interface is slightly different - the parameters are passed in the JSON body, and we use a POST request.
For more information on the API for the Titan server, see the openAPI spec, at http://${ELASTIC_IP}/docs
.
The response should be the same as before, but much faster. For the sake of building responsive and interactive apps, the server also exports an endpoint that lets you stream tokens back from the server one by one. This way, the user can see the model generating the response in real time.
To test it out, use the /generate_stream endpoint, like so.
Make sure to add the -N
flag to curl, so that it doesn't buffer the response.
time curl -X 'POST' \
"http://${ELASTIC_IP}/generate_stream" \
-N \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"text": "List 30 things to do in London",
"generate_max_length": 100,
"sampling_topk": 1,
"sampling_topp": 1,
"sampling_temperature": 1,
}'
Compression: running on a small GPU, or on a CPUβ
We've explored the first option: speeding up inference on a GPU.
It wasn't obvious from the response, but the model that was running was actually substantially smaller and more resource efficient, as well as being faster.
If you look at the GPU memory utilization (by running the bash command nvidia-smi
) while the model is running, you'll see that the model is only using ~7GB of GPU memory.
The same command run on the original model would show that it was using ~13GB of GPU memory (in float16), or 26GB (in float32).
A table of each model memory usage, and the GPUs that would be able to run them, is shown below.
Model | Memory Usage | Possible GPUs |
---|---|---|
falcon-7b-instruct (fp32) | 26GB | A100, A6000 |
falcon-7b-instruct (fp16) | 13GB | RTX 3090 |
falcon-7b-instruct (titan-optimized) | 7GB | RTX 3050 |
Using Titan takeoff to enable CPU inferenceβ
Run the following command to start a CPU enabled instance.
export CPU_INSTANCE_ID=$(aws ec2 run-instances --image-id ami-0271ce88f6c03e149 --count 1 --instance-type c5.2xlarge --key-name MyKeyPair --security-group-ids $SECURITY_GROUP_ID --block-device-mappings DeviceName=/dev/sda1,Ebs={VolumeSize=200} --query 'Instances[0].InstanceId' --output text)
Give it an elastic IP address, and associate it with the instance.
export CPU_ALLOCATION_ID=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
export ASSOCIATION_ID=$(aws ec2 associate-address --instance-id $CPU_INSTANCE_ID --allocation-id $CPU_ALLOCATION_ID --query 'AssociationId' --output text)
export ELASTIC_IP=$(aws ec2 describe-addresses --filters "Name=association-id,Values=$ASSOCIATION_ID" --query 'Addresses[0].PublicIp' --output text)
Connect to your new instance:
ssh -i MyKeyPair.pem ubuntu@$ELASTIC_IP
Remember to install the iris package
pip install titan-iris
Now, use iris takeoff
to launch the server, in the CPU instance.
iris takeoff --model tiiuae/falcon-7b-instruct --port 80 --device cpu
real 0m15.735s
user 0m0.011s
sys 0m0.005s
Let's compare costs. The p3.2xlarge instance costs
0.3840.
The CPU inference is ~8x cheaper!
The Takeoff inference server has reduced the latency gap between CPU and GPU inference, and made CPU only inference a viable option for many use cases.
In addition, token streaming means that for many applications, the time to first token is much lower than the time to the full response.
Conclusionsβ
This blog post demonstrates the potential of the Titan Takeoff Inference Server, not only as a tool for deploying LLMs but also as a cost-efficient solution. We hope you found it useful and look forward to seeing what you'll build with this knowledge!
If you have any questions, comments, or feedback on the Titan Takeoff server, please reach out to us on our discord server. For help with LLM deployment in general, or to signup for the pro version of the Titan Takeoff Inference Server, with features like automatic batching, multi-gpu inference, monitoring, authorization, and more, please reach out at hello@titanml.co.
Cleanupβ
Remember to clean up all the AWS resources that were created during this tutorial. To cleanup your ec2 instance, run the following
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
aws ec2 terminate-instances --instance-ids $CPU_INSTANCE_ID
To release the elastic IP address, run the following
aws ec2 release-address --allocation-id $ALLOCATION_ID
aws ec2 release-address --allocation-id $CPU_ALLOCATION_ID
And to delete the security group we created, run the following
aws ec2 delete-security-group --group-id $SECURITY_GROUP_ID
Make sure to do this both for the GPU and CPU instances.
Footnotesβ
Footnotesβ
-
There are other options besides long-lived access keys that might be more appropriate for your use case. See here for more information. β©
-
The model has 7b paramers. float32 datatypes have 4 bytes per parameter, so the memory footprint of the 7b parameter model is at least 28GB. In practice, we'll also need to load the torch runtime, and computation will allocate even more memory. Without loading in float16, we'd be unable to run the model on this instance. β©
-
sudo
is required because we're binding to a 'low port': i.e. 80, which is necessary, because we're serving the model over the internet. Because we're using sudo, and because sudo by default doesn't inherit the environment variables from the enclosing shell, we also have to give the absolute path to the uvicorn executable with$(which uvicorn).
For local development,uvicorn main:app --port 8000
will work, and the model will be available from the EC2 instance athttp://localhost:8000
. β© -
The
time
command means that, after the request is finished, the time it took to complete the request will be printed. β©